oocoim limn 



AOTHOI 

fOB OJLIE 
VOTE 



ti m m 



BOFS PfilCE 
OSSCBZPTOBS 



BeelntMt Mark D* 

soa« Dttoislen ProcedaMs for Dm with tailored 
jun 79 

nop,: Paper presented at the Heetiag of Coaputer 
Asaieted Teatiaq miaaeapolist HB« Jaae* 1979) 

HF0VPC02 Plaa Pestaqe. 

Bay^alan Statlstlcat ♦Criterion Befercnced Teetai 
Decision Halcinc: GQessinq fTeeta) ; ♦Haatery Testa ; 
Heascreaent Technique St Probability; «PsychOBetricsi 
Test Interpretation 
lOENTZPl^fiS coaputer iissisted Testinq; ^Tailored Testing 

ABSTBACT 

'^f * paper describes two procedures for aaking binary 
classification derisions uslnq tailored testlnq: the sequential 
probability ratio test ISPBT) and a Baveslan declslcn procedure. The 
first procedure Idescrlbed , the SPPT, was developed by Bald for 
quality control Iwork. it has nbt been widely applied for testing 
app^.lcatlons because the assuaptlen of an equal probability of a 
correct response was aade to facilitate the derivation of the 
operating characteristic tOC\ snd average saaple nuaber <ASN) 
functions. The results of the appllcttlon of the SPBT with a 
slaulated procedure are described. The second decision procedure, the 
Bayeslan procedure* Includes a prior distribution of student 
achleveaent* a loss function for l*ncorrect decisions, and the cost of 
observations In the developaent of the decision rule. The basic 
philosophy of this procedure Is to adaln later Iteas until the 
expected loss Incurred in saklro a decision is less than the expected 
loss after the next I tea is adalnlstered plus the cost of 
adpilnistratlon. This procedure Is not yet operational for aaklng 
decisions under tailored testlnq because appropriate loss functions 
for educational decisions have not been deteralned. (Author/BW) 



* Pepro duct Ions supplied bv EPFS are the best that can be nade * 

* from the orioinal document, * 





SJS!.!^*^'^^*'- RESOURCES 
INFOfUIATlON CENTER (ERIC).- 



O 

00 



Some Decision Procedures 
for Use with Tailored Testing 
by 

Hark 0. Reckase 
University of Missouri -Columbia 



TNII eeCWMtNT" MAI teiM •imI 
euCIO IKACTLV At MCliVIO *mS 



There are many applications of testing technology that require that 
decisions be made as to whether a person Is above or below a criterion 
score. Avcepting a candidate Into a program Is an example of such a 
decision. Criterion-referenced tyy^tlng and Its special case» mastery 
testing, are other areas that require similar classifications. In the 
criterion-referenced testing application. It would be especially useful 
If the decisions could be made quickly and conveniently for each student 
In an Individualized Instruction program. The recently developed technology 
of tailored testing (Lord, 1970) has the potential to fulfill the require- 
ments of such a testing system. However, no generally accepted procedure 
exists for making classification decisions using tailored testing, probably 
because these testing techniques are still relatively new. The few proce- 
dures that do exist are either based on randomly sampling items (Epstein, 
1978; Sixtl, 1974), which does not take advantage of the power of tailored 
testing, or on heuristic techniques (Weiss, 1978) that do not have a sound 
theoretical base. The purpose of this paper is to present some decision 
procedures that operate sequentially that can easily be applied to tailored 
testing without losing any of the elegance and mathematical sophistication 
of the examination procedures. 



Paper presented at the meeting of Computer Assisted Testing '79, Minneapolis, 
June, 1979. This research was supported by Contract Number N00014-77-C0097 
from the Personnel and Training Research Programs of the Office of Naval 
Research. 



Numerous tailored (adaptive, response contingent, sequential, etc.) 
testing procedures now exist In the research literature ranging from simple 
two-stage procedures (Betz and Weiss, 1973) to complex Bayeslan procedures 
(Owen, 1969). Weiss (1974) has written a good review of the tailored 
testing procedures that have been developed up until 1974. Although many 
procedures exist, for the purposes of this paper only tailored testing 
procedures using item characteristic curve (ICC) theory and maxlmum-llkellhood 
ability estimation will be considered. It Is also assumed that the tests 
are administered to the examinees by computer using some type of computer 
terminal, and that Items are selected to maximize the value of the Infor- 
mation function at the previous ability estimate. Despite the narrow 
definition of tailore'd testing used for this paper, the results should 
generalize to any procedure based upon item characteristic curve theory. 

In applying the decision procedures discussed in this paper, two 
specific ICC models will be used; the one- and the three-par&meter logistic 
niodels. These models were selected because of their frequent appearance 
in the research literature and because of the existence of readily available 
calibration (LOGIST, CALFIT) and tailored testing programs (Reckase, 1974). 
Any other ICC model could ^ust as easily be used. 

Sequent^l Decision Procedures 

A cursory review of the statin^kal literature quickly indicates 
that much has been written about sequential estimation and classification - 
procedures. While somewhat more obscure/than ANOVA and regression procedures. 



most Intermediate level mathematical statistics books Include at least 

one chapter on sequential analysis (see .Prunk, 1965; Chapter 16 for example) 

In ^n ongoing review of the extensive literature t»jat exists pn this topic 

(wh(lch has accumulated over 200 references), It has been found that most 

procedures fall Into one of three categories: sequential probability 

ratio tests (SPRT) (Wald. 1947), Bayeslan sequential procedures (eg. DeGroot, 

4 

1970), and curtailed ?s Ingle sampling plans (Dodge and Romlg; 1929). Of 
these procedures, only the SPRT Is narrowly specified— the other two refer 
to families of procedures rather than a single technique. 

Although these statistical procedures are widely applied for quality 
control, little use has been made of them In the area of mental testing, 
probably because operable sequential testing procedures did not exist 

until recently. Since all references In the testing literature to sequential 

\ 

decisions discovered to date have used the SPRT (Sixtl, 1974; Epstein, 
1978; Reckase, 1978), that procedure will be described first, followed 
by the Bayesian procedure. The curtailed sampling plans will not be dis- 
cussed in this paper because they cannot be readily applied to the coimionly 
used tailored testing procedures. 

The Sequential Probability Ratio Test?' (SPRT) 

The sequential probability ratio test was Initially developed by 
Abraham Wald as a quality control device for use by the Armed Forces during 
World War II. Since he has written an excellent book on the subject (Wald, 
1947) and since this procedure was clearly described at the last meeting 
of this conference (Epstein, 1978), the procedure will be only briefly , 
described here. It is not the purpose of this paper to duplicate the 



. ;^4*;' ............. 

■ " "\ ' " ' ■ 

efforts of Epstein, but rather to generalize the pfocedure sd that It win 
more directly apply to tailored testing. 

Wald originally developed the SPRT as a statistical, test to decide 
which of two simple hypotheses Is morf correct. For example. It might 
be interesting to determine whether a student can answer correctly 60% 
or B0% of the Items In an Item pool.' The basic philosophy behind the 
procedure used to declda between these two alternatives was to determine 
the likelihood of an observed response to an Item under the two alterna- 
tive hypotheses. If the likelihood were sufficiently larger for one 
hypothesis than the other, that hypothesis would be accepted. If the 
two likelihoods were similar, another observation would be taken. Wald 
(1947) has shown that one hypothesis wiU-always be selected over another 
using a finite set of items. 

To deni^nstrate this procedure, suppose an item is randomly selected 
from an Item pool and administered to a student. If a correct response 
were obtained, the likelihood under H, (80% knowledge) would be .80, and 
the likelihood under (60% knowledge) would be .60. To evaluate these 
likelihoods, Wald takes the ratio of the two 

L(x«l|H^) Qo 

If the ratio is sufficiently large, is accepted; if it is sufficiently 
small, is accepted; and if it is near 1.0 another observation is taken. 
The values of this ratio that are considered sufficiently large or small 
depend upon what is considered acceptable for the two possible decision 
errors: (a) accepting when is true (a error); and (b) accepting 




lower decision point « B y 



6 . 



(2) 



upper decision point « A • 



1 - 3 



(3) 



a 



Thus, if the likelihood ratio Is less than or equal to B, is accepted 



with error probability approximately 8. If the Ukelihood ratio is greater 
than or equal to A, is accepted with error probability approximately 
a. If the ratio is between B and A, another item should be randomly 
sampled and administered, and the decision rule implemented again. If 
u a .05 and 8 » .10, for example, the decision points would be at B » .105 
and A = 18. Since the likelihood ratio (1.67) is between these two values, 
no decision would be made, and another item would be selected and adminis> 
tared. 

Since the responses to the items follow a binomial distribution in 
this example, a general expression for the likelihood ratio can be developed 
for the administration of n items: 




L(xi, X 



• • • • 



• • • » 




Po ^(l-Po) 



(1-Pi) 



n-zx. 



n-zx,. 



(4) 




n--:x. 



where Is the score on Item 1 (0 or 1), Is the proportion of Items 
known by the student In the item pool under H^, and p^ Is the proportion 
known in the item pool under H^. If 

0 

* 

L(x^ • • • • • ) 

L(x, "y^) tA. accept H,. (5) 

If L(x,, .... X-|H,) 

L(x, gH^) ^8. accept Hp. (6) 

9 • *• 

Otherwise, continue administering Items. 

Although this procedure was originally developed to test simple hypo- 
theses* Wald (1947) has shown that the procedure operates In the same 

« 

way for composite hypotheses. For example, suppose it was desirable to 
know whether a student knew more than some proportion, pp of the items 
In an item pool. In order to use the SPRT to make this decision, a region 
must first be selected around p for which it does not matter which decision 
is made— say Pq<p<p^ . If p^ ,is close to Pp a very precise decision is 
required. If p^ and p^ define a wide indifference region around p, a 
rather gross decision rule is all that is needed. The SPRT is then carried 
out in exactly the same fashion as above, using p^ and p^ as the values 
for hypotheses and respectively. When the decision points A and 
B are computed as above, the error rates, a and hold for true values 
of p at and p^ . For true values of p more extreme than p^ or Pp the 
error rates are lower. 

In order to evaluate the properties of the SPRT, two functions have 
been derived; the operating characteristic (OC) function and the average 
sample nunter (ASN) function. The OC function Is defined as the probability 



7 



of accepting hypbtHesIs as a function. of the true proportion of the 
Item pool known by the student. Although the deprivation of the OC func- 
tion Is somewhat complex, the function can be approximated by the following 
two formulas. 



These equations are used by substituting In various arbitrary values of 
h and solving for p and L(p). L(p), the probability of accepting H^, 
is then plotted against p to describe the OC function, Figure 1 shows 
an OC function for a » .05, 6 • .10, p^ » .6, and p^ « ,b! Note that 
at p « the height of the curve Is equal to 1-a, and at p « p^ the 
height of the curve is equal to 8. Note that the OC function is only 
dependent upon a, 3, p^ and p^ . Also, the steeper the curve, the more 
accurate is the SPRT decision rule. 

TrTsert Figure 1 about here 

The ASN function is defined as the expected number of items required 
to make a decision at the various values of the true proportion of Icnown 

e 

items, E(n|p). The formula for the ASN function for the binomial case 
described above is 



£(n|p) - MP) 1" B ♦ {1.l(p)j in A • ■ 

Where all of the symbols are as described above and the.logrlthins are to 
the base e. Figure 1 also shows the ASN function for the example presented 
above. Note that the ASN function Is highest between the points p^ and 
P^, and the closer together the values of p^ anc are, the, higher the 
curve In that region. In general, the lower the ASN curve, the more 
efficient the decision rule. 

Although the SPRT as defined above Is a valuable procedure for decision 
making In many situations. It makes an Implicit assumption that limits 
its usefulness for tailored testing. The model as presented assumes that 
the probability of a correct response Is the same for all Items In the 
pool. This assumption is reasonable if Items are randomly selected and 
p is the proportion of the items that a student can answer correctly, 
but it is not reasonable if items are selected to maximize information 
at an ability level. Under the tailored testing model assumed by this 
paper, the probability of a correct response changes with each item, 
requiring a modification of the model. 

Fortunately, a detailed analysis of Wald's (1947) work indicates 
that the sequential random sample assumption is not necessary for the 
application of the SPRT, but is needed only for the derivation of the 
OC and ASN functions. The SPRT can then be directly applied to tailored 
testing, but the OC and ASN functions must be determined in a different 
manner. One approach to determining these functions will be presented 
later. 



.9. • • 

» . - 



To deoionstrate the application of the SPRT to tailored testing as 
defined by this paper, suppose that a tailored test Ijl being used to deter- 
mine whether a student has exceeded the criterion specified for a criterion- 

* 

referenced test. Although the method for selecting this criterion Is 
^Currently not well specified, assume that a value, e^, has been determined 
and that students above this value on the latent achievement scale pass 
the unit, while those below are given more instruction^ 

In order to use the SPRT, a region must be specified around ^ for 
which It does not matter whether a pass or a fall decision Is made. If 
high accuracy Is desired for the decision rule, a narrow Indifference 
region must be specified^ but more Items will be required to make the 
decision. As the region! gets wider, the decision accuracy declines, but 
fewer items are requlrecj. Values of 6, and mark the boundaries 
of this Indifference rej^lon (dQ<d^<d^). Once these values have been 
selected, the likelihood ratio can be defined as 

where L(xp x^|0|^), k = 0, 1, Is the likelihood of the student's 
response string for the n-i terns 'administered so far, is the 0, 1 score 
on Item i, is the probability of a correct response to Item 1 assum- 

ing ability tjj^ determined from the appropriate ICC model, and Q^(9j^) * 
I-Pi(u,). 

If the one-parameter logistic model is used as a basis for the tai.lored 
testing procedure. Equation 10 becomes 



ERIC 



l(x,,. .... xje,) i-iYTTV^ 

Where 1$ the difficulty parameter for Item 1. Equation 11 can be 
simplified to • 



The values of this likelihood ratio can then be used to test whether the 
student is above or below e^. using the same method presented earlier. 
If the ratio Is greater than A - ^1^. the student Is classified as being 
above e^; If it Is below B « (7^. the student Is classified below the 
criterion; othen»*1se another Item Is administered. If the three -parameter 
logistic model Is the basis for the tailored testing procedure, the SPRT 
procedure is applied in exactly the same manner as above, except 

Da^(e,^-b^) 
1 + e ^ ^ 

is used in Equation 10 instead of the simple logistic form. 

The evaluation of the OC and ASN functions cannot be performed as 
easily as for the simple binomial model due to the presence of the item 
parameters in the formula for computing the probability of a correct response. 



11 



since the Item parameters for the next Item to be administered are dependent 
-^n the Item pool used and the responses to the previous Items, the derive* 
tlon of these functions depends on-a- complex string of conditional expectations 
The conditional probabilities Involved make the derivation of these functions, 
for all practical purposes. Impossible. Therefore the OC and^ASN functions 
can only be approximated using simulation techniques, but these approxima- 
1:1 ons should be adequate for most purposes. Some OC and ASN functions 
for tailored tests based on the one- and three-parameter logistic models 
will be presented later in this paper. Hote, however, that although the 
full OC function cannot be derived, the value of the function Is equal 
to l-a at and b at e^, assuming that the Item parameters are known. 
Since in all cases except simulation^ the item parameters are only estimated, 
in reality these two points are not knoWn either. 

Bayeslan Sequential Decision Procedure 

The Bayeslan decision procedure Is an alternative to the SPRT for 
dec idi^ig whether or not a student has exceeded the criterion, e.. Although 
this procedure is much more complicated than the SPRT, it has the capability 
of using additional information in making the decision. This added inform- 
ation may improve the decision process. In order to describe this procedure, 
some basic concepts will first be defined. 

Initially* it is assumed that a population of students exists such 
that each student has some definable achievement level, o. Individual 
achievement levels are labeled e^. Each person is to be tested and a 
decision is to be made concerning placement above or below the criterion. 
The decision to place'above the criterion score is labelled d^ and tj^ 
decision to place below the criterion score is d«. 



In order to decide upon «. decision rule using Bayeslah methodology, 
three pieces of Information are required 1ii advance. These are (a) a 
prior distribution of e, (b), a loss function rela.ting*the achievement 
levels to the decisions, and (c) the cost of each observation. Using 
these three types of information, a decision rule, (technique for select- 
ing a decision) and a stopping rule (techniquiB .for deciding when a decision 
should be made) can be determined. 

The basic concept used in chooitng a decision, rule is the concept 
of risk. Risk is defined as the expected loss given a decision. Obviously, 

* • • 

the decision that minimizes the risk is the desired one: When a Bayeiian . 
prior is used, this minimum risk Is called the Bayes risk. 

The stopping rule used with the Bayesian sequential decision proce- 
dure is also based upon the Bayes risk concept. If the expected risk • • 
after taking another observation plus the cost of the observation is less 
than the risk before the observation is taken, the sampling should go on. ' 
However, if the expected risk plus cost of a new observation is greater 
than the risk without the observation, then sampling should cease. In 
some cases, it is best not to take any observations at all because the 
expected risk plus the cost of an observation is greater than the initial 
risk of a guess based on the prior distribution of achievement. 

Based on this framework, theorems have been proven that show that 
an optimal procedure exists, and that the optimal procedure will reach 
a decision after some finite number of observations (DeGroot, 1977). If 
the risk decreases with each observation, the procedure is called a regular 
sequential decision procedure. Only regular procedures will be considered 
here since it is assumed that each item administered yields some positive 
information rather than providing some misinformation. 

13 



'In order to make the description of this procedure easier to follow* 
a simplified example will .now U presented. Although this example is not 
realistic, it demonstrates the basic concepts without requiring complicated 
mathematical expressions. *The extension of the procedure to realistic 
situations is direct, but the mathematics is cumbersome. Suppose that 
two types of individuals exist in .the population of interest, those with 

« -.8 and those with « +.8 on a latent achievement dimension. A 
tailored test is to be used to classify the individuals into two groups— 
those above and those below the criterion scor^ 0.0. Thus, two decisions 
are possible;, classify as d^, above the criterion; and d^, below the 
crlterion^^ * * 

If persons with ability -.8 are classified above the criterion, a 
loss of 25 is Incurred Hn- each case. If they are classified below the 
criterion, there is no loss. If persons with ability .8 is classified 
above the criterion, there is no loss, while a- loss of 15 is incurred 
for each person. classified balow the criterion. This loss function Is ' 

ft 

summarized below. It should be noted that these loss function values are 
totally arbitrary. 



Loss Function 



0 


15 


25 


0 



Suppose that the prior belief that a randomly selected person has 
ability .8 Is .6 and that he/she; has ability -.8 Is .4. Then the first 
step In using a Bayesian sequential decision process Is to determine the 



rUk associated with and 62 when no observations are taken. The expected 
loss (risk) If decision d^ Is picked Is 

E(loss|d^) « P(e^)i(d^|e^) + HQ2^^^^]^^2^ 
« .4 X 25 + .6 X 0 
= 10, 

where Is the prior probability of 0^ anc' i(dj|e^) Is the loss from" 
picfclng decision dj when 0^ Is true. The expected loss (risk) If dg Is 
picked Is * ' * 

Edossldg) » P(0i))l(d2|ei) + P(02)t(d2|62) 
« .4 X 0 + .6 X 15 
» 9. 

Thus the Bayes decision when no observation is taken is d2» and the Bayes 
risk is 9. The decision dg is obviously chosen because it has the lower 
risk. 

Although the proper decision has been determined for the case when 
no observations have been taken » it has not been determined whether or 
not an observation should be taken. To do' that, the expected risk after 
one observation plus cost must be compared to the Bayes risk without an 
observation. Determining the expected risk after an observation requires 
several steps, the first of which is determining the posterior distri- 
bution of ability after an observation. 

Suppose that an item of 0.0 difficulty is administered to a person 
with ability .8 or -.8. Depending upon whether the response is correct 
or incorrect, a Bayesian posterior can be determined using Bayes theorem. 



P(e^|x) I ^' (16) 

If a correct response Is obtained to the item, the posterior probability 
of a .8 ability Is given by 

The probabilities of an ability of .8 or -.8 were given in the. prior dis- 
tribution as .6 and .4 respectively. The probability of a correct response, 
given the known ability, can be determined from the appropriate ICC model. • 
For example, using the one-parameter logistic model 

^(.8-0) 

while P(l|-.8) « .31. The posterior probability of .8 is then P(.8|l) » 
.77. Similarly, the posterior probability of -.8 is P(-.8|l) ^ .23. The<. 
posterior probability of the .8 and -.8 abilities given an incorrect response 
can likewise be determined using Equation 16. The posterior probabilities 
given an incorrect response are P(.8|0) « .37 and P(-.8|0) « .63. 

The next step is to determine the risk using the posterior distribu- 
tions just computed. If a correct response is obtained, the expected 
loss for d^ <s .23 x 25 + .77 x 0 « 5.75. The expected loss ;f or dg is 
.77 X 15 + .23 X 0 = 11.55. Thus if a correct response is obtained, the 
Bayes decision is d^ with a Bayes risk of 5.75. If an incorrect response 
is obtained, the expected loss for d, is .63 x 25 + .37 x (1 « 15.75, while 



16 



the expected loss for dg Is .37 x 15 + .63 x 0 « 5.55. Thus, after an 
Incorrect response, dg Is the Bayes decision with a Bayes risk of 5.55. 

Since It Is not known whether a correct or Incorrect response will 
be given, the expected risk regardless of the response must be computed. 
To compute the overall expected riskj the probability of a correct and 
an incorrect response Is needed. The proNblllty can be obtained using 
the following formula: 

« 

P(l) - P(1|.8)P(.8) + P(l|-.8)P(-.8) 

« .69 x .6 + .31 X 4 

» .538 
P(0) « 1 - P(l) « .462. 

The expected risk after a i-esponse can now be determined from 

E(r1sk|response) « £(loss|l )P(1 ) + E(loss|0)P(0) 

" 5.75 X .538 + 5.55 x .462 
» 5.66. » 

At this point, whether or not another observation should be taken 
can be determined. If the expected loss after an observation plus cost 
is greater than the risk before an observation, than administration of 
items should cease. If the risk before an observation is taken Is greater, 
than another item should be administered. In the example given here, 
assume the cost of a response Is 1 unit. The expected loss after a response 
plus cost is then 5.66 + 1 « 6.66. Since the Bayes risk with no items 
administered was 9, another item should be administered. Depending on 
the response to the item, decision d, or d« could be selected. After 



the Item Is administered, the aporoprlate posterior becomes the new prior 
and the process continues as above. A flowchart of the entire decision 
process is presented In Figure 2 so that a more global picture of the 
steps Involved can be obtained. 

Insert Figure Z about here 

Although there are many postltlve factors in the use of the Bayeslan 
procedure, the very information that makes the control of the testing 
\si'tuation more precise also makes it difficult to initially Implement. 
For example, specifying reasonable loss functions on the same metric as 
the cost of an observation is difficult for most educational applications. 
What is the cost of misclassifying persons below the criterion score when 
they really should be classified above it? Some attempts have been made 
by this author to specify loss functions for tailored testing applications, 
but no satisfactory results have been obtained so far. 

A second difficulty in the application of this procedure is in spec- 
ifying the prior distribution of achievement for a group. This is not 
as serious a problem as determining loss functions since performance data 
are usually available from previous groups. But of course, the more 
accurate the prior distribution, the more accurate the decision based 
on the procedure. 

It should be realized that the procedure presented here is a simpli- 
fication of a procedure that would be used for actual tailored testing 
applications. Achievement levels are usually continuous, rather than 
discrete as presented here, and the loss due to an incorrect decision 
is d function of the person's distance from the criterion score rather 



Cx 



is 



* 



.18- 



than a constant value. The /procedure can also be modified by changing 
the cost of observations with Increasing test length to allow for fatigue 
effects. Unfortunately » the Bayeslan decision procedure as described 
here has not yet been implemented la conjunction with an operating tailored 
testing procedure. However, plans are being developed to evaluate an 
operational version at the Tailored Testing Research Laboratory at the 
University of Missouri. 

Soto Simulation Results for the SPRT 

^ ••4 

Before implementing the SPRT procedure described earlier In this paper, 
information was desired on how the procedu**e functioned when items were 
not randomly sampled from the Item pool. Also, some experience was needed 
in selecting the bounds of the Indifference Vegi on, and e^. The effects 
of guessing on the accuracy of classification when the one-parameter logis- 
tic model was used was another area of Interest. 

To determine the effects of these variables, the computation of the 
SPRT was prograitined into both the one- and three-parameter logistic tailored 
testing procedures that were operational at the University of Missouri- 
Columbia. These procedures have been described in detail previously (Koch . 
and Reckase, 1978) so they will tie mere^ly summarized here. The programs 
implementing both models used a fixed stepsize method for branching through 
an item pool until both a correct and an\ Incorrect response had been given. 
After that point, all ability estimates were obtained using an empirical 
maximum likelihood estimation procedure. Items were selected for both 
niodels to maximize the item information at the previous ability estimate. 



19 



ERIC 



-19- 

To evaluate the decision making power of the SPRT, subjects *ilith known 
ability were needed. Therefore, a simulation routine was built Into the 
tailored testing program In place of the responding live examinee. At 
the beginning of each simulation run. th^ true ability of the simulated 
examinee was Input into the program. This value was used to determine 
"the true probability of a correct response to the administered items based 
on the model used, (one- or three-parameteV logistic) and the estimated 
^ i item parameters. A number was then randomly selected from a uniform dis- 

tribution on the range from 0 to 1 . If the randomly selected number was 
less than or equal to the probability of a correct response, the Item 
was scored as correct. If the randomly selected number was greeter than 
the probability of a correct response, the item was scored as incorrect 
This procedure continued for each Item in the tailored test. 

Research Design 

Tailored tests were simulated twenty-five times at each true ability 
using different seed numbers for the random number generator. True abilities 
from -3 to +3 at .25 intervals were used for both the one- and three- 
parameter models to evaluate the performance of the SW. In addition, 

I 

simulations were run on a composite procedure in which tailored test proce- 
dure and the probability ratio calculations (Equation II) were based using 
the one-parameter model, but the item responses were determined using the 
three-parameter model. This was done to determine the effects of guessing 
on correct classification using the one-parameter logistic model. 

In computing the probability ratios, three sets of limits of the 
indifference regions were used: +.3, +.8, +1. A criterion of = 0 



ERIC 



Mas assumed In all cases. The ratios were computed after each Item was 
administered and the results were compared to an A value of 45 an$l a 6 
value of .102. These were determined based ofT a • .02 and '0 « .10. A 
classification was made the first time these limits were exceeded. If 
the limits were not exceeded before twenty items had been administered 
(an arbitrary upper limit on test length), values above KO were class- 
ified as above d^. and the values below 1.0-were classified as below e^. 
This is called a truncated SPRT. At each true ability used for the simula- 
tion, the proportion of the 25 administrations classified below e. and 
the average number of items admljilstered were computed. Plots of these 
values against the true abl Utiles approximate the OC and ASN functions, 
respectively. These plots were made f or ^ch combination of Indifference 
region and tailored testing method, yielding nine plots of the OC and 
ASN functions. * 

Two different Item pools were used for this study. For the analyses 
using just the one-parameter or the three-parameter model, an existing 
pool of 72-vocabulary items were used. This Item pool tiad an approximately 
normal distribution of difficulty parameters. For the one-parameter tailored 
test using three -parameter responses, an Item pool with 181 Items, rectan- 
gularly distributed between -3 and on difficulty was used. These 
simulated items had constant discrimination parameters of .588 (this value 
yields a 1.0 when multlpled by D » 1.7) and psuedo-guessing parameter of 
.12. This simulated item pool was selected over the real vocabulary pool 
to have better control over the guessing parameters. The one-parameter 
procedure used only the b-values from the pool. 



21 



Results 

I The results of the simulation studies will be presented In three 
Mrts; first the one-parameter SPRT, then the three -parameter SPRT, and 
finally the results of the combined simulations. Plots of the X and . 
ASN functions are presented to sunmarlze the results of the SPRT for these 
models. 

Onci-parameter model 

Figure 3 shows the OC functions for the one-parameter logistic model • 
based on the vocabulary Item pool. The figure shows three graphs, one 
for each of the +.3, +.8, and +1 Indifference regions. Note that the 
curves are reasonably similar regardless of the Indifference region. 
The similarity Indicates that In all three cases the classification accur- 
acy Is nearly the same. 

Insert Figure 3 about here 

The values of the curves at the limits of the Indifference region 
give further evaluative Information. At the lower point, the OC function 
should pass through 1 > a. At the -.3 value, the curve Is In fact .85 
when It should be .98, showing the degrading effects of restrictive stop- 
ping rules used by the tailored testing procedure. At the -.8 and -1 
points for the corresponding curves, the results are about as expected, 
being .94 and 1.00 rather than .98. 

At the upper limit of the Indifference region the OC function should 
have a value of .1. For the .3 case it Is in fact .5 rather than .1, 



again showing the effects of truncating the procedure. At the. values of 
.8 and 1, the values of the OC function were near or better than what they 
should have been based on the theoretically expected results. 

' The ASN functions for the one-parameter model are given In Figure 
4. The curves plotted correspond to the ASN functions using Indifference 
regions for +.3, +.8, end +1. It can 1nined1aJ»ly be seen from the graph 
that there Is a substantial difference In the average number of Items 
needed to reach a decision, with the greatest number required when the 
Indifference region- is -narrowest. It can also be seen that the largest 
expected number of Items Is near the criterion score of 0.0 and th^t the 
average number drops off at the extreme abilities. The slight lack of 
symnetry in the curves Is due to the fact that a was not equal to 3. 
For abilities beyond +1, an average of only about 3 to 5 Items was neededj 
for classification for the wider regions, while 6 to 11 were needed for 
the +.3 indifference region. Note that the +.3 curve Is approaching the 
arbitrary twenty item limit for the ta11ored\ests. ^' 

♦ 

Insert Figure 4 about here 

Figure 5 shows the theoretical curves for the ASN and OC functions 
based on the +.3 indifference region for comparison purposes. An Infinite 
number of items, with difficulty 0.0 was assumed for the theoretical func- 
tions, and the tests were assumed to have no upper limit on the number 
'of items administered. A comparison of Figures 3 and 4 with Figure 5 
shows that the OC curve for the theoretical furiction is steeper at the . 
cutting point than the simulated curves, and the ASN functiort is substan- 
tially higher; The difference in the theoretical and simulated OC curves 



•23- 

• • • 

shows the effect of the 20 Item stopping rule and th^ selection of Items 
of differing difficulty. 

Insert Figure 5 about here' 

Three^parameten model 

The results of the simulation of the three*parameter logistic taHored 
test are given In Figures 6 and 7. Figure 5 presents the OC functions 
for the three-parameter model, again using the Indifference regions of 
i.3» >.8» and jj\ Notice that, as with the one-parameter model » the OC 
curves are fairly similar for the three Indifference regions throughout 
most of the range of ability* However » there are discrepancies for the 
+1.0 Indifference range curve near the +1 and -1 points. Indicating a 
decline In decision precision for that region. At the -.3 value for the 
+.3 Indifference range, the value of the curve Is .96, fairly close to 
the .98 theoretical value. At the upper end (.3), however'^ the value Is 
.2 instead of the .1 value that It should be. This may show the effects 
of gue'ssing on the decision process. The +.8 and +1 Indifference regions 
again Vield better error probabilities than would be expected from the 
theory. 

The ASN function for the three-parameter model (Figure 6) also shows' 

< 

similar results to those obtained from the one-parameter juodel . The +.3 
indifference region required the greatest t»nurober of Items, while +.8 and 
tl.O required about the same number. As before, the largest number was 
required near the criterion score. However, with the three -parameter 
model, far fewe^*♦tems on the average were required to make a decision 



•24* 

than for the one-parameter model. Of special note 1$ the ASN value of 
about 1.0 In the -1 to -3 range on t^e ability scale. Decisions seem ^ 
to be possible with very few Items In that range. • 

_ insert Figures 5 and / about herj" 

Because of the guessing component of the three-parameter logistic 
model, the ASN function tended to yield more asywnetrlc results than the 
<wie-parameter model. More Items were required when classifying high than 
for classifying low to compensate for the non-zero probability of a correct 
response. Also, the ASN curve for the +.3 Indifference region was much 
more peaked than* Its one-parameter counterpart. If the simulated curves 
for the three-parameter model are compared to the theoretical curves pre- 
sented in Figure 5, the OC functions can be seen to match the theoretical 
functions fairly closely, while the ASN functions show that substantially 
fewer items were required. Over much of the ability range, as many as 
ten times more items were specified by the theoretical ASN curve when 
unlimited identical items were assumed. However, it should be.noted that 
the theoretical curves are based on the one-parameter model. 

♦ * 

Effect of guessing on the one~parameter model 

Figure 8 shows the OC functions for the one-parameter model when 
the three-parameter model was used to determine the responses. The figure 
shows three graphs, one for each of the +.3, +.8, and +1 indifference 
regions. Note that the curves are fairly similar regardless of the indiff- 
erence region, but that they are shifted substantially to the left compared 
to the previous OC curves. This indicates that the probability of classifying 



-25- 

» 

a person below 6. has dropped off substantially until an ability of about 
-2 has been reached. In other words* It Is much easier to be classified 
above the criterion score using this procedure than when guessing does 
not enter Into the decision. The effective criterion has been shifted 
down to -1.5 Instead of being at zero. Clearly the values of the OC func- 
tion at the limits. of the Indifference region are entirely different from 
the theoretical values. 

insert Figure 8 about here 

The ASN functions for the three Indifference regions, +.3, +.8, and" 
^1, are shown In Figure 9. The difference between these graphs and those 
presented In Figure 4 are that the curves are higher (more Items are 
required) and the highest point of the curve Is shifted over to the 
steepest part of the OC curve. The relationship between th^ height of 
the ASN function and the width o' the indifference region still holds; 
however, as the region gets wider, the average number of items decreases. 

Insert Figure 9 about here 

Sunwary and Conclusions 

The purpose of this paper has been to describe two procedures for making 
binary classification decisions using tailored testing, the sequential 
probability ratio test (SPRT) and a Bayesian decision procedure, and to 
present some simulation data showing the characteristics of the operation 
of the SPRT for two item characteristic curve models. The first proce- 
dure described, the SPRT, was developed by Wald for quality control work. 



-26- 

It ha| not been widely applied for testing appall cations because the assump 
tlon of an equal probability of a correct response was made to facilitate 

♦ 

the derivation of the operating characteristic (OC) and average sample 
number (ASN) functions. Since this assumption can only be met for testing 
applications by randomly sampling Items for administration* the procedure 
has not been used with tailored testing. In this paper, the probability 
of a correct response was allowed to vary from Item to Item, although It 
made the derivation of the OC and ASN functions Impossible. Simulation 
procedures were then used to estimate these functions. 

The SPRT procedure described is operational at the Tailored Testing 
Research Laboratory of the University of Missouri -Columbia in two forms: 
a 11 "6 tailored testing procedure, and a simulated procedure. The results 
of the application of the simulation procedure to three studies were 
described in this paper. The first study estimated the OC and ASN func- 
tions for a one-parameter logistic based tailored testing procedure in 
which the size <^f .the indifference region around ^the criterion-score was 
varied. The results of the study showed that the average number of items 
needed for classification was quite low when the true ability of a Simula- 
ted person was not too close to the criterion socre» and that the width 
of the indifference region did not greatly affect the OC function. The 
width of the indifference region did have a substanti.aV'effect on the ASN 
function. The accuracy of classification .of the simulated tailored test 
was not quite as good as adln?njster1ng' a large number of items with diffi- 
culty values equal to the criterion score. This result was explained by 
the arbitrary 20' item limit impose^ on the tailored test aitd the variation 
in^the difficulty parameters'of the items administered. 



. .27- 

•> 

The second study estimated the OC aind ASN functions for a three-parameter 
logistic tailored testing procedure /also varying the size of the Indifference 
region. The results were similar to those for the one-parameter model, 
but even fewer Items were generally needed for classification. The results 
of these first two studies both Indicated that the SPRT could be succeiss- 
fully applied to tailored testing. 

The third simulation study estimated the OC and ASN functions for 
the one-parameter model when guessing was allowed to enter into the" responses , 
to the items administered. The jpesults showed that guessing in effect 
lowered the criterion score, making it easier to classify an examinee above' 
the criterion, and raising the average number of Items needed for class- 
ification. This spurious shift in the criterion greatly increased the 
error rates in classification. The effect Is strong enough to preclude 
the use of the one-parameter model for classification decisions when guessing 
Is a factor. 

The second decision procedure described in this paper allows the use 
of a greater amount of information In making a decision than the SPRT. 
The Bayesian procedure includes a prior distribution of student achieve- 
ment, a loss function for Incorrect decisions-, and the cost of observations 
in the development of the decision rule. The basic philosophy of this 
procedure is to administer items until the expected loss incurred in making 
d decision is less than the expected loss after the next item is adminis- 
tered plus the cost of administration. At that point a decision is made 
that minimizes the expected loss. The Bayesian procedure is described 
in detail and a simple example Is given of its use. The Bayesian proce- 
dure is not >et operational for making decisions under tailored testing 



-28- 

■ • ■ ' I ^ • • ■ ■ ■ ' 

because appropriate loss functions for educational decisions have not been 
determined. HoMever, simulation studies of the procedure will connence 
In the near future. 

Both of the decision procedures described In this paper show promise ' 
for use In tailored testing. Both also require substantial research effort 
before they can.be applied with confidence. It Is hoped that this paper 
will help to stimulate that research. 



-29- 
References 

BeU, N. E. and Weiss', 0, J, An empirical study of coniputer-adinlnlstered 
two-stage .ability testing. (Research Report 73-4), Psychometric 
Methods Program, University of Minnesota, Minneapolis, 1973. 

Brunk, H. 0, An introducatlon to mathematical statistics (2nd Ed,). 

* 

New York, Blalsdell, 1965. 
DeGroot, M. Optimal statistical decisions. New York, McGraw-Hill, 1970. 

Dodge, H. F. and Romlg, H. G. A method of sampling Inspection. Bell 
System Technical Journal . 1929, 8, 613-631. 

Epstein, K. Applications of sequential testing procedures to performance 
testing. Proceedings of the 1977 Computerized Adaptive Testing 
Conference. University of Minnesota, Minneapolis, Minn.: July, 
1978. 

Koch, W. R. and Reckase, M. D. A live tailored testing comparison study 
of the one- and three-parameter logistic models . (Research Report 
78-1). University of Missouri, Columlita, MO: Tailored Testing 
Research Laboratory, June, 1978. 

Lord, F. M. Some test theory for tailored testing. In W. H. Holtzman 
(Ed.), Computer-assisted instruction, testing and guidance . New York: 
Harper and Row, 1970. 

Owen, R. J. A Bayeslan approach to tailored testing. Princeton, N. J.: 
Educational Testing Service, Research Bulletin RB-69-92, 1969. 

f 

3>i 



•30* 



Reckase, M. 0. A generalization of sequential analysis to decision roakl^ 
with tailored testing. Paper presented at the meeting of the Military 
. Testing Association, Oklahoma City. Novenfcer, 1978. 

Reckase, M. 0. An Interactive computer program for tailored testing based 
on the one-parameter logistic model. Behavior Research Methods and 
Instrumentation . 1974, 6, 208-212. 

Sixtl, F. Statistical foundations for a fully automated examiner. 

Zettshrlft fur gntw icklunqspsvchologle und Paqagoqische Psvchologle. 
1974, 6(1), 28-38. 

Wald, A. Sequential Analy sis. New York: Wiley, 1947. 

Weiss, 0. J. Presentation at the ONR Contractors meeting. University 
of Missouri, Columbia, MO: September, 1978. 

Weiss, D. J. Strategies o f adaptive ability measurement . (Research Report 
74-5). Psychometric Methods Program, University of Minnesota, 
Minneapolis, December, 1974. 



31 




32 



nOURE 2 
FLOVifCHART OF BAYESIAN 
OECiSrON PROCESS 



COMPtirC CXPCCTCO LOSS 
mOflC AN OBSERVATION 
fOn CACH DteiSION 

z 



r 



SELECT BATES OECKION 
AND BATES HISK 
WITHOUT OBSERVATION 
,ii = 



COMPUTE POSTERIOR 
ASSUMWS CORRECT 
RESPONSE 



COMPUTE EXPECTED 
LOSS ASSUMING ^ 
CORRECT '^RESPONSE 

^--^r — 



COMPUTE EXPECTED LOSS 
AFTER RESPONSE 

I 




COMPUTE POSTERIOR 
ASSUMINS INCORRECT 
RESPONSi^ 
1 



COMPUTE EXPECTED 
LOSS ASSUMINS 
INCORRECT RESPONSE 

r 



SELECT lAYES 
DECBKM 
AND BATES RISK 




COMPUTE 
PR0BABILITC8 
OF EACH RESPONSE 




SELECT BATES 
DECISION 
AND BATES RISK 







IS 

loss BEFORE 
IBSERVATKMH ORFATER 
THAN LOSS AFTER 
OBSERVATION PLUS^ 
COST? 



YES 



SELECT 
APPROPRIATE 
POSTERIOR 
AS NEW PRIOR 



NO 


STOP 




AND 




MAKE 


A 


DECISION 





FIGURE 3 
ONE-PARAMETER OC FUNCTIONS 
FOR THREE INOIf F.ERENCE REGIONS 



\ 



1.0- 



a: 
o 



UJ 



.8 



.6- 



INDIFFERENCE 
REGIONS 

±,3 

+.8 

+1.0 



O 

a. 
o 




ACHIEVEMENT (6) 



: : " FIGURE 4 

ONE-PARAMETER ASN FUNCTIONS 
FOR THREE INDIFFERENCE REGIONS 

IWOIFFERENCE - | 
. REGION I 




-3 



— r 
-2 



T 

•I 0 

ACHIEVEMENT (e) 



2 



For Three iNOxmnci^eft Rcmons 



X04 



Indxfperemce 
i.3 



CoMPosiTt OC Functions 

For Thme Inoxffmencc Regions 



FiauRc 9 

CoMPOMTE ASN Functions 
For Thupc IwDxrrfRfNCt RtixoNs 



Indxfpeaknck 

i.8 

±10 




■10 



